According to the documentation(https://search.r-project.org/CRAN/refmans/spData/html/boston.html),this dataset contains housing data that was collected as part of the 1970 census of Boston, Massachusetts.The data frame has 506 rows and 20 columns and it contains the corrected data from the Harrison and Rubinfeld (1978) data.Each observation (row) in the dataset contains a collection of statistics corresponding to a single census ‘tract’ (a small geographic region containing multiple houses, defined specifically for a census). Some notes are that that MEDV is censored, in that median values at or over USD 50,000 are set to USD 50,000.
In this study we will consider the spatial distribution of the CMEDV variable. This variable corresponds to the median value (in USD 000s) of owner-occupied housing in each census tract. Each tract is also associated with a point location; geographic coordinates for this point (measured in decimal degrees latitude and longitude), as well as the town in which it is located (within the Greater Boston area), are provided for each observation.
We are going to derive a smaller dataframe from the above data set that contains only the variables TOWN, LON, LAT and CMEDV:
| TOWN | LON | LAT | CMEDV |
|---|---|---|---|
| Nahant | -70.96 | 42.26 | 24.0 |
| Swampscott | -70.95 | 42.29 | 21.6 |
| Swampscott | -70.94 | 42.28 | 34.7 |
| Marblehead | -70.93 | 42.29 | 33.4 |
| Marblehead | -70.92 | 42.30 | 36.2 |
| x | |
|---|---|
| TOWN | 0 |
| LON | 0 |
| LAT | 0 |
| CMEDV | 0 |
Coordinates
In the map below , we can already see that the points on the map representing the the latitudes and longitudes, are not matching the towns.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Coordinates on map
In order to be sure, we are going to zoom in and in fact we can clearly see that some of the towns appear on the water.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Coordinates on map
Finally we analyse our data further by inspecting individual towns on Google Maps and in particular Cambridge. First we search for the right coordinates on Google Maps and add them to our map. The blue dot represents the correct coordinates, while the green dot shows the coordinates in our data. We can clearly see that there is a significant difference between the two.
## Assuming "LON" and "LAT" are longitude and latitude, respectively
Zoom on map
Zoom on map
In order to correct the data, we suppose that all coordinates are shifted by a certain amount. We assume that there are \(n_j\) observations in town \(j\), and for each observation \(k\) in town \(j\),we denote the longitudinal coordinate as $x_{j,k} , k = 1,, n_j $. Then we assume:
\[ x_{j,k}=TC^{(x)}_j+\Delta^{(x)}_{j,k}\] where \(TC^{(x)}_j\) is the longitudinal coordinate of the center of town j, and \(\Delta^{(x)}_{j,k}\) is the displacement of observation \(k\) in town \(j\) from the town center.We also assume that the latitudinal coordinates (which we denote \(y_{j,k}\)) satisfy a similar relationship. The suggested systematic error is therefore such that \((TC^{(x)}_j ,TC^{(y)}_j)\) has been misspecified for \(j = 1, \dots, n\) where n is the number of towns.
To find the displacement, we are going to use the correct center coordinates for each town in Boston that exist in the file BostonTownCentres.csv. First we are going to have a quick look at the data. Note: We can see that the towns in this instance are of type character.
## Rows: 92 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): town
## dbl (2): lat, lon
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| town | lat | lon |
|---|---|---|
| Arlington | 42.41537 | -71.15644 |
| Ashland | 42.26066 | -71.46413 |
| Bedford | 42.49173 | -71.28179 |
| Belmont | 42.39593 | -71.17867 |
| Beverly | 42.55843 | -70.88005 |
next we’re using an appropriate mutating join to combine the two data
sets.We check and observe that the number of columns in
boston.c doesn’t match the number of columns in the new
data frame.We find that the missing data corresponds to Saugus, which is
spelled as Sargus in boston.c. As a result, we correct the instances of
Sargus and join the corrected data frame with BostonTownCentres.This
time the column match.
## [1] FALSE
## [1] "Sargus"
## [1] TRUE
Next we’re going to visualise the correct coordinates.We can alredy observe that there are no points on water and they seem to match the towns on the map.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Correct coordinates on map
We’re going to zoom into an area to check if everything is in order.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Zoom on correct coordinates on map
Zoom on correct coordinates on map
Final maps
Final maps
Final maps
Final maps
Finally, we construct a visualisation that shows the spatial distribution of the median value of owner-occupied housing in Greater Boston in 1970. In this instance, we are going to use ggmap.We observe that for some towns have only one observation so we can’t create polygons.
## Source : https://maps.googleapis.com/maps/api/staticmap?center=42.36008,-71.05888&zoom=10&size=640x640&scale=2&maptype=terrain&key=xxx-0NQyKizPR9jdAYCfTiyB5IhVfbdU2xI
We are going to use the elect80 data set which contains
the Presidential election results of 1980 covering 3,107 US counties
using geographical coordinates. First of all, we are going to use the
FIPS codes to find the exact state and county. We are going to use the
county.fips function which is a database matching FIPS
codes to maps package county and state names.After matching our two
dataframes we drop the columns and keep the region and pc_turnout
variables. To plot the outlines of a geographical region, we use
ggplot2::map_data(). This will extract coordinate data from the maps
library, to create a data frame containing the boundaries of one of a
selection of geographical regions. Once we have the coordinates for the
boundaries of our spatial regions, we can match this to the values of
our spatial variable of interest using one of the ‘mutating joins’ from
the dplyr library.